6 - Clustering

Author

CDN team

Published

Last compiled on 18 November, 2024

Introduction

Clustering is the process of grouping cell identities based on their gene expression profiles and takes place after dimensionality reduction. For scRNAseq data we use community detection algorithms, this is done by computing a K-nearest-neighbor graph and then grouping cells based on their shared nearest neighbors. To effectively obtain fine grained annotations, it should be done iteratively so the least amount of information is lost or misidentified.

In this notebook, we learn:

  • Clustering in Seurat requires the computation of a K-nearest-neighbor graph followed by the identification of clusters in this graph using the functions FindNeighbors() and FindClusters().

  • The k parameters in FindNeighbors() determines the connectivity of the graph. A higher the K sets a more highly connected graph which results in fewer clusters.

  • The resolution parameter in FindClusters() controls the granularity of the clustering. A higher resolution results in more clusters.

  • Clustering results can be evaluated quantitatively by looking at the sample and batch distribution across clusters, using the Simpson diversity index, and a silhouette analysis of the clusters.

Review

We left off (last session) filtering out the poor quality data without regard for its distribution. Last week, we learned how principal components are calculated, what a latent space is, and defined a kNN (k-nearest neighbors) graph.

To review those:

We take a cell x gene matrix, normalize it, and reduce its dimensions using PCA. We looked at the Elbow plot to determine the optimal number of PCs to capture the variance. After selecting the top 30 PCs, we generated a k-nearest neighbors (kNN) graph to represent the relationships between cells based on their gene expression profiles and ran FindClusters() to identify clusters of cells based on their neighborhood relationships. Kyle introduced Harmony as an integration technique to correct for batch effects and I just went over QC metrics including doublet detection that help us understand the quality and content of our data.

Glossary

PCA - A dimensionality reduction technique that reduces the number of dimensions in a dataset while retaining as much variance as possible. The first principal component captures the axis with most variance in the data, each subsequent component captures the next independent/orthogonal axis of variability accounting for the most variance left.

Latent Space - The latent space is the low-dimensional representation of the data that captures the most important features.

kNN Graph - A graph that represents the relationships between cells based on their gene expression profiles. Each cell is connected to its k (1, 20, 100) nearest neighbors in the graph.

SNN Graph - A graph that represents how many neighbors are shared between 2 cells. This ensures a more robust graph to outliers.

Resources

  1. Challenges in unsupervised clustering of single-cell RNA-seq data
  2. Current best practices in single-cell RNA-seq analysis: a tutorial
  3. Bioconductor
  4. Single Cell Best Practices

Load Libraries and Data

if (!requireNamespace("dplyr", quietly = TRUE))
    install.packages('dplyr')
if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages('tidyverse')
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages('Seurat')
if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages('colorBlindness')
if (!requireNamespace("RColorBrewer", quietly = TRUE))
    install.packages('RColorBrewer')
if (!requireNamespace("cluster", quietly = TRUE))
    install.packages('cluster')
if (!requireNamespace("viridis", quietly = TRUE))
    install.packages('viridis')
if (!requireNamespace("ggplot2", quietly = TRUE))
    install.packages('ggplot2')
if (!requireNamespace("ggalluvial", quietly = TRUE))
    install.packages("ggalluvial") 
if (!requireNamespace("tidygraph", quietly = TRUE))
    install.packages("tidygraph") 
if (!requireNamespace("ggraph", quietly = TRUE))
    install.packages("ggraph") 
library(dplyr)
library(Seurat)
library(tidyverse)
library(RColorBrewer)
library(colorBlindness)
library(DoubletFinder)
library(cluster)
library(viridis)
library(ggplot2)
library(ggalluvial)
library(tidygraph)
library(ggraph)

set.seed(687)

Load Data

# Load the Seurat object with doublet and batch information
se <- readRDS('../data/Covid_Flu_Seurat_Object_Quality.rds')
se 
An object of class Seurat 
33234 features across 59572 samples within 1 assay 
Active assay: RNA (33234 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 2 dimensional reductions calculated: pca, umap
# Set color palette for cell types
pal <- paletteMartin
names(pal) <- sort(unique(se$Celltype))

donor_pal <- c(
    "#66C2A4", "#41AE76", "#238B45", "#006D2C",
    "#41B6C4", "#1D91C0", "#225EA8", "#253494", "#081D58",
    "#FFEDA0", "#FED976", "#FEB24C", "#f79736", "#FD8D3C",
    "#FC4E2A", "#E31A1C", "#BD0026", "#ad393b", "#800000", "#800050")

names(donor_pal) <- c(
    "Normal 1", "Normal 2", "Normal 3", "Normal 4",
    "Flu 1", "Flu 2", "Flu 3", "Flu 4", "Flu 5",
    "nCoV 1", "nCoV 2", "nCoV 3", "nCoV 4", "nCoV 5",
    "nCoV 6", "nCoV 7", "nCoV 8", "nCoV 9", "nCoV 10", "nCoV 11"  
)

Set Up

PCA

Let’s first look at the data according to the top 2 principal components.

celltype_pca <- DimPlot(
        se,
        reduction = "pca",
        group.by = 'Celltype',
        cols = pal
        ) 

celltype_pca

Construct the kNN graph with FindNeighbors():

FindNeighbors() is the Seurat function that calculates the k-nearest neighbors of each cell in PCA space. The number of neighbors used for this calculation is controlled by the k.param parameter with a default value of 30. It computes pairwise distances between cells based on their gene expression using algorithms such as: euclidean distance, manhattan distance, Pearson correlation distance, or cosine similarity.

From BioStars:

FindNeighbors() is a function that is used to find the nearest neighbors of your single cell data point within a dataset. It works by calculating the neighborhood overlap (Jaccard index) between every cell and its k.param nearest neighbors. It’s often employed in various applications such as anomaly detection and dimensionality reduction. The concept is that given a data point, you want to identify the closest data points to it based on some similarity metric, such as Euclidean distance or cosine similarity. This helps to identify similar points in the dataset, which can be useful for making predictions or understanding the distribution of the data.

The default values of FindNeighbors() are:

See ?FindNeighbors for more information. Let’s modify these to fit our analysis:

se <- se %>%
    FindNeighbors( 
        reduction = "pca",
        dims = 1:20,
        k.param = 20,
        verbose = FALSE
    )
# Look at the k-nearest neighbors (nn) and shared nearest neighbors (snn) graphs computed
se@graphs
$RNA_nn
A Graph object containing 59572 cells
$RNA_snn
A Graph object containing 59572 cells

FindClusters

FindClusters() is used for identifying clusters of cells based on their neighborhood relationships typically obtained from PCA or t-SNE. It inputs the graph made from FindNeighbors() and outputs cluster assignments for each cell found at se@meta.data$seurat_clusters.

The resolution parameter controls the granularity of the clustering. Higher values of resolution will result in more clusters, while lower values will result in fewer clusters. The default value is 0.8.

From BioStars:

FindClusters() is a function used for clustering data points into groups or clusters based on their similarity. It uses a graph-based clustering approach and a Louvain algorithm. Clustering is an unsupervised learning technique where the algorithm groups similar cells together without any predefined labels. The goal is to find patterns and structure in your data. The number of clusters and the algorithm used can vary based on the problem and data characteristics. Common clustering algorithms include K-means, hierarchical clustering, and DBSCAN.


where ‘algorithm’ =

1 - original Louvain algorithm

2 - Louvain algorithm with multilevel refinement

3 - SLM (Smart Local Moving) algorithm

4 - Leiden algorithm

See ?FindClusters for more information.

Clustering

Clustering is considered a classical unsupervised machine learning task. Seurat uses the community detection algorithm, Leiden, to identify cell clusters. A community detection algorithm identifies groups of nodes in a network that are densely connected. The Leiden algorithm connects cells by finding tightly connected communities in the shared nearest neighbor graph computed in FindNeighbors. Shared nearest neighbor graphs are more robust than k-nearest neighbors graphs. sNN graphs weight edges based on the number of shared edges with other cells. Leiden is a 2019 improvement on the Louvain algorithm so it is common to see both in scRNA-seq analysis.

In clustering, the goal is not to see how many clusters you can pull apart but it is an iterative process. Especially in the first pass, you want to pull apart main cell groups such as epithelial cells and immune cells so you can further refine clusters to extract more granular cell types in the next iteration.

After clustering, we’ll review some cluster validation techniques to qualitatively and quantitatively check the quality of the clustering results.

se <- FindClusters(
      object = se,
      resolution = c(0.01, 0.05, 0.1, 0.15, 0.2,0.25),
      algorithm = 1) 
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 2044941

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9958
Number of communities: 4
Elapsed time: 10 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 2044941

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9853
Number of communities: 9
Elapsed time: 12 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 2044941

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9765
Number of communities: 12
Elapsed time: 11 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 2044941

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9708
Number of communities: 16
Elapsed time: 11 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 2044941

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9661
Number of communities: 18
Elapsed time: 11 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 2044941

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9621
Number of communities: 18
Elapsed time: 13 seconds
# Results seen at RNA_snn_res in metadata
colnames(se@meta.data)[str_detect(colnames(se@meta.data), "RNA_snn_res")]
[1] "RNA_snn_res.0.01" "RNA_snn_res.0.05" "RNA_snn_res.0.1"  "RNA_snn_res.0.15" "RNA_snn_res.0.2"  "RNA_snn_res.0.25"

KNN vs SNN

Subset Seurat object to visualize KNN and SNN

se_sub <- se[, sample(colnames(se), 1000)] # Subset to 1000 cells

se_sub <- se_sub %>%
    FindNeighbors(
        reduction = "pca",
        dims = 1:30,
        k.param = 10,
        verbose = FALSE,
        return.neighbor = FALSE,
        compute.SNN = TRUE
    )

Now let’s do some processing to visualize the KNN and SNN graphs

# Extract PCA
pca_coords <- se_sub@reductions$pca@cell.embeddings[, 1:3] %>%
    as.data.frame() %>%
    rownames_to_column("name")

## Edges for KNN graph
edges_knn <- data.frame(se_sub@graphs$RNA_nn, check.names = FALSE) %>%
    tibble::rownames_to_column("source") %>%
    pivot_longer(cols = -source, names_to = "target") %>%
    dplyr::filter(source != target & value > 0) %>%
    dplyr::select(-value)

## Edges for SNN graph
edges_snn <- data.frame(se_sub@graphs$RNA_snn, check.names = FALSE) %>%
    tibble::rownames_to_column("source") %>%
    pivot_longer(cols = -source, names_to = "target", values_to = "jaccard") %>%
    # Prune SNN - for this use case we need to have a jaccard index > 0.14
    dplyr::filter(source != target & jaccard > 1 / 7) %>%
    dplyr::select(-jaccard)

# Prune SNN - for this use case we need to have a jaccard index > 0.14. This is 
# a strict threshold for visualization purposes to highlight the robustness to 
# outliers

# Create graph objects
graph_knn <- tbl_graph(
        edges = edges_knn,
        nodes = pca_coords,
        directed = FALSE) %>%
    activate(nodes)

graph_snn <- tbl_graph(
        edges = edges_snn,
        nodes = pca_coords,
        directed = FALSE) %>%
    activate(nodes)

Visualize KNN and SNN graph

# Visualize KNN graph
pknn12 <- ggraph(graph_knn, layout = 'manual', x = PC_1, y = PC_2) +
    geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
    geom_point(aes(x = PC_1, y = PC_2), color = "blue", size = 3) +
    theme_minimal() +
    labs(title = "KNN in PCA Space", x = "PC_1", y = "PC_2")

# Visualize SNN graph
psnn12 <- ggraph(graph_snn, layout = 'manual', x = PC_1, y = PC_2) +
    geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
    geom_point(aes(x = PC_1, y = PC_2), color = "orange", size = 3) +
    theme_minimal() +
    labs(title = "SNN in PCA Space", x = "PC_1", y = "PC_2")

pknn12 | psnn12

# Looking also PC1 vs PC3 for a different perspective
pknn13 <- ggraph(graph_knn, layout = 'manual', x = PC_1, y = PC_3) +
    geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
    geom_point(aes(x = PC_1, y = PC_3), color = "blue", size = 3) +
    theme_minimal() +
    labs(title = "KNN in PCA Space", x = "PC_1", y = "PC_3")

# Visualize SNN graph
psnn13 <- ggraph(graph_snn, layout = 'manual', x = PC_1, y = PC_3) +
    geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
    geom_point(aes(x = PC_1, y = PC_3), color = "orange", size = 3) +
    theme_minimal() +
    labs(title = "SNN in PCA Space", x = "PC_1", y = "PC_3")

pknn13 | psnn13

We can see that by looking at the SNN graphs we prune connections between extreme points and, therefore, improve the robustness of our clustering.

RunUMAP

UMAP runs on the PCA space. The dims parameter specifies which dimensions of the PCA space to use for the UMAP calculation. The default value is 1:30, which uses all dimensions. The n.components parameter specifies the number of dimensions in the UMAP embedding. The default value is 2.

se <- se %>% 
    RunUMAP(dims = 1:30, verbose = FALSE) # Run UMAP 

Visualize Clusters

DimPlot(
    se,
    group.by = c(
        "RNA_snn_res.0.01", "RNA_snn_res.0.05",
        "RNA_snn_res.0.1", "RNA_snn_res.0.15",
        "RNA_snn_res.0.2", "RNA_snn_res.0.25"),
    label = TRUE
)

Community detection doesn’t follow a hierarchical clustering process. The algorithm splits the graph into more communities at higher resolution and does not subcluster based on major clusters that are present at lower resolutions. Clustering at a low resolution splits the dataset into a certain number of groups while clustering at a high resolution draws different lines on that same larger corpus of data. Below we show how this method of clustering (community detection) is different from hierarchical clustering.

se@meta.data %>%
    dplyr::count(
        Celltype, RNA_snn_res.0.01, RNA_snn_res.0.05,
        RNA_snn_res.0.1, RNA_snn_res.0.25) %>% 
    ggplot(
        aes(axis1 = RNA_snn_res.0.01, axis2 = RNA_snn_res.0.05,
            axis3 = RNA_snn_res.0.1, axis4 = RNA_snn_res.0.25,
            y = n)) +
    geom_alluvium(aes(fill = RNA_snn_res.0.01), width = 0.1) +
    geom_stratum(width = 0.1) +
    geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
    scale_x_discrete(
        limits = c("RNA_snn_res.0.01", "RNA_snn_res.0.05",
                   "RNA_snn_res.0.1", "RNA_snn_res.0.25"),
        expand = c(0.15, 0.05)) +
    scale_fill_brewer(palette = "Dark2") +
    labs(title = "Compare Cluster Resolution of 0.01, 0.05, 0.1 and 0.25",
         x = "Clusters",
         y = "Count") +
    theme_minimal()

We see how in this case it follows a pretty good hierarchical splitting across higher resolutions some cells do tend to branch out into other clusters.

Different Resolutions

0.05 vs 0.1 Resolution

DimPlot(
    se,
    group.by = c(
        "Celltype","RNA_snn_res.0.05", "RNA_snn_res.0.1"),
    label = TRUE
    )

Comparing the 0.05 and the 0.1 resolutions, Cluster 0 in the 0.05 resolution appears to split into 2 looking at Clusters 1 and 2 in resolution 0.1. While this seems helpful in sifting through the data, we’ve decided to keep the large blobs together. This saves us visibility of our dataset in the second level of annotation. Resolution 0.05 has the best distribution, but let’s quantitatively analyze it’s clusters quality.

Cluster Metrics

Cluster Diversity

Splitting the clusters by cluster size and sample diversity allows us to make sure not one sample is being clustered separately from others.

# seurat_clusters <- "RNA_snn_res.0.05"
diversityPerGroup <- function(df, species, group, diversity_metric = 'simpson') {
  # Convert strings to symbols for curly-curly operator
  species_sym <- rlang::sym(species)
  group_sym <- rlang::sym(group)
  # Count groups per species directly using curly-curly
  tierNcount <- df %>%
    group_by({{species_sym}}) %>%
    count({{group_sym}}, name = "n") %>% ungroup
  # Pivot table to allow vegan::diversity call
  tierNwide <- tierNcount %>%
    pivot_wider(names_from = {{group_sym}}, values_from = n, values_fill = list(n = 0))
  # Use rownames of species for the diversity function, which needs a dataframe
  tierNwide_df <- as.data.frame(tierNwide)
  # species column must be first
  tierNwide_df <- tierNwide_df %>% select({{species}}, everything())
  rownames(tierNwide_df) <- tierNwide_df[, 1]
  tierNwide_df <- tierNwide_df[, -1]
  # Calculate diversity
  diversity_values <- vegan::diversity(tierNwide_df, index = diversity_metric)
  # Prepare result as a dataframe
  result <- data.frame(
    species = rownames(tierNwide_df),
    diversity = diversity_values,
    row.names = NULL
  )
  # Rename diversity column
  names(result)[1] <- species
  names(result)[2] <- sprintf('%s_diversity', group)
  return(result)
}

# Calculate simpson's diversity per cluster
clusterMetrics <- diversityPerGroup(se@meta.data,
                        species = 'RNA_snn_res.0.05',
                        group = 'Sample ID',
                        diversity_metric = 'simpson')

# Calculate number of cells per cluster and join to metrics table
clusterMetrics <- clusterMetrics %>% left_join(se@meta.data %>% count(RNA_snn_res.0.05))
head(clusterMetrics)
  RNA_snn_res.0.05 Sample ID_diversity     n
1                0         0.913441363 23573
2                1         0.923354092 17543
3                2         0.922488819  6591
4                3         0.875371149  3756
5                4         0.007659878  2865
6                5         0.621039497  2688

Let’s visualize Simpson’s diversity index by cluster

ggplot(clusterMetrics, aes(x = RNA_snn_res.0.05, y = n)) +
  geom_segment(aes(
          x = RNA_snn_res.0.05, xend = RNA_snn_res.0.05,
          y = 0, yend = n),
      size = 1.5, color = 'grey80') + # Draw lines for lollipops
  geom_point(aes(color = `Sample ID_diversity`), size = 5) + # Add colored lollipop 'heads', coloring by 'Sample ID_diversity'
  scale_y_log10() +
  # scale_x_continuous(breaks = seq(0, 20)) + 
  scale_colour_viridis(
      option = "C",
      name = "Sample ID Diversity",
      direction = 1,
      limits = c(0, 1)) + # Colorblind-friendly, vibrant color palette
  theme_minimal(base_size = 10) +
  theme(legend.position = "bottom",
        axis.text = element_text(size = 12), 
        axis.title = element_text(size = 14), 
        title = element_text(size = 16)) +
  labs(x = "Clusters",
       y = "log(cluster size)",
       title = "Simpson Diversity Index per Cluster") +
  guides(colour = guide_colourbar(title.position = "top", title.hjust = 0.5))

Clusters 4 and 6 appear to have very low sample diversity. Let’s investigate which samples are in each cluster to see if there is any bias.

Plot sample distribution per cluster

# Prepare the data for plotting
plot_sample <- se@meta.data %>%
  count(RNA_snn_res.0.05, `Sample ID`) %>%
  group_by(RNA_snn_res.0.05) %>%
  mutate(proportion = n / sum(n)) %>%
  ungroup()

# Create a stacked bar plot
ggplot(
    plot_sample,
    aes(x = factor(RNA_snn_res.0.05),
        y = proportion,
        fill = `Sample ID`)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
      title = "Distribution of Sample IDs Across Clusters",
      x = "Seurat Clusters",
      y = "Proportion",
      fill = "Sample ID") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = donor_pal)

Sample Flu 1 appears to make up the majority of Cluster 4 and Flu 3 of cluster 6. Looking at it with a table -

table(se$`Sample ID`, se$RNA_snn_res.0.05)
          
              0    1    2    3    4    5    6    7    8
  Flu 1     241   39   30    4 2854 1619    0  100    8
  Flu 2     413  639  243   30    0    3    0    2    1
  Flu 3     811  141   20   51    4   36 1533    5    0
  Flu 4     395  442  157   20    0   19    5    1    1
  Flu 5     287   76    1    5    0   13    1  269    0
  nCoV 1   1035 1630  723  798    3   88    0  186    1
  nCoV 10   834 1572  554  188    0   79    0    8    4
  nCoV 11  1320 1971  630  453    3   41    0    2    5
  nCoV 2   3513  747  479  175    0   68    0   10    7
  nCoV 3    172  155   18   49    0    4    0    0    0
  nCoV 4    132  372  163   24    0   20    2    1    0
  nCoV 5   1962 1284  491  164    0   70    1    2    4
  nCoV 6   2257 1116  594  299    0  257    0    0    3
  nCoV 7    305  397   90  660    0   48    0    2    0
  nCoV 8    377  439  184  510    0   79    0  284    0
  nCoV 9    298  678  250  102    1   15    0    0    1
  Normal 1 2852  615  809   18    0   32    0    2    3
  Normal 2 2444 1701  393   40    0   45    0    2   21
  Normal 3 1959 1868  410  134    0   69    0   31   19
  Normal 4 1966 1661  352   32    0   83    0   11   18

Another way to visualize cluster diversity is through a stacked bar plot instead of lollipops. This provides a super clear way to compare these numerical metrics side by side.

ggplot(clusterMetrics,
       aes(
           x = 1, 
           y = as.character(RNA_snn_res.0.05),
           fill = `Sample ID_diversity`,
           label = n)) +
  geom_tile(colour = "white") +
  geom_text(nudge_x = 1.5, size = 3) +
  geom_text(aes(label = signif(`Sample ID_diversity`, 2)),size = 3) +
  scale_fill_distiller(palette = "Spectral", limits = c(0,1)) + theme_classic() +
  coord_fixed(ratio = 0.25) + 
  expand_limits(x = c(0.5,3)) +
  labs(
      x = "Diversity                                   Size",
      y = "RNA_snn_res.0.05",
      fill = "Simpson Diversity\nfor Sample") +
  theme(axis.text.y = element_text(hjust = 1, vjust = 0.5, size = 12),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_text(size = 15),
        legend.key.size = unit(1, 'cm'),
        legend.title = element_text(size = 10), 
        legend.text = element_text(size = 10)
        )

Visualizing cluster diversity based on batch is also an important thing to look out for. This is similar to the QC motivation where these plots provide a way to explain your data and possible findings before claiming any biological phenomena.

# Prepare the data for plotting
se@meta.data %>%
    count(RNA_snn_res.0.05, batch) %>%
    group_by(RNA_snn_res.0.05) %>%
    mutate(
        proportion = n / sum(n),
        batch = as.character(batch)) %>%
    ungroup() %>%
    # Create a stacked bar plot
    ggplot(
        aes(x = factor(RNA_snn_res.0.05),
            y = proportion,
            fill = batch)) +
    geom_bar(stat = "identity", position = "stack") +
    labs(x = "Seurat Clusters", y = "Proportion", fill = "Batch",
       title = "Distribution of Batches Across Clusters") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_color_manual(values = donor_pal)

Silhouette Analysis

As covered in our workshop, silhouette analysis is a way to measure how similar each cell is to the other cells in its own cluster compared to ones in other clusters. The silhouette value ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters and vice versa.

seurat_clusters <- as.character(se@meta.data$RNA_snn_res.0.05)

pca_embeddings <- Embeddings(se, reduction = 'pca')

# Calculate silhouette widths
sil_scores <- silhouette(x = as.numeric(seurat_clusters), dist = dist(pca_embeddings))

# Extract silhouette scores
silhouette_data <- as.data.frame(sil_scores)
head(silhouette_data)
  cluster neighbor sil_width
1       0        2 0.4385896
2       0        2 0.3529505
3       0        5 0.3056284
4       4        6 0.2638074
5       0        2 0.2766471
6       2        0 0.4487648
# recover cell type names
row.names(silhouette_data) <- row.names(pca_embeddings)

silhouette_arranged <- silhouette_data %>% 
    mutate(cluster = as.character(cluster)) %>%
    group_by(cluster) %>% 
    arrange(-sil_width)

Visualize silhouette scores

overall_average <- silhouette_arranged %>% 
  ungroup %>% 
  summarize(ave = as.numeric(mean(sil_width))) %>% 
  pull(ave)

ggplot(
    silhouette_arranged, 
    aes(x = sil_width, 
        y = cluster, 
        fill = cluster, 
        group = cluster)) +
    geom_bar(stat = "identity", position = 'dodge2') +
    geom_vline(xintercept = overall_average,
               color = 'red',
               linetype = 'longdash') +
    theme_minimal() +
    labs(title = "Silhouette Analysis",
        y = "Cluster",
        x = "Silhouette width",
        fill = "Cluster") +
    theme(axis.text.y = element_text(hjust = 1, vjust = 0.5, size = 20),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_text(size = 20),
        legend.position = "None")

The dashed red lines shows the average silhouette width across all clusters. Clusters 1 and 5 stick out here and we can look at their distribution on a UMAP for more context.

So let’s look at silhouette scores on a UMAP

d5 <- DimPlot(se,
        reduction = 'umap',
        group.by = 'RNA_snn_res.0.05', 
        label = TRUE)

se$CellID <- rownames(se@meta.data)

sil_ids <- silhouette_data %>%
    rownames_to_column('CellID') %>%
    left_join(se@meta.data, by = 'CellID')
se <- AddMetaData(se, sil_ids)

FeaturePlot(
  se, 
  feature = "sil_width") + 
  ggtitle('Silhouette width') + 
  scale_color_distiller(
      type = "div",
      palette = "BrBG") | d5

Cluster 5 seems to have a specifically poor silhouette score. This potentially means these cells are not clustered correctly. This could be due to this cluster being under-clustered - leading to multiple cell types/states ending up within it. Let’s check if there is heterogeneity within cluster 5. In the next notebook we will take a closer look on what might be going on

Save Seurat object

saveRDS(se, file = "../data/clustered_se.rds")

Extra: More on Clustering Algorithms

The Louvain algorithm was developed in 2008 and is a popular community detection algorithm used in scRNA-seq. It recursively merges communities into a single node and executes the modularity clustering on the condensed graphs.(Zhang) Both Seurat and scanpy use Louvain as the default clustering algorithm.


Leiden Algorithm

The Leiden algorithm was published in 2020 as an improvement of the Louvain algorithm. Leiden creates clusters by taking into account the number of links between cells in a cluster versus the overall expected number of links in the dataset. It computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts with an initial partition where each node from its own community. Next, the algorithm moves single nodes from one community to another to find a partition, which is then refined. Based on a refined partition an aggregate network is generated, which is again refined until no further improvements can be obtained, and the final partition is reached. (Single Cell Best Practices).

There are a couple of popular clustering algorithms. There is no one way to cluster as clustering is a means of looking at the data from different angles. The most popular clustering algorithms are the louvain algorithm, leiden algorithm, hierarchical clustering, and k-means clustering. Seurat uses the Leiden algorithm by default which is an improvement on the Louvain algorithm.

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggraph_2.1.0         tidygraph_1.3.1      ggalluvial_0.12.5    viridis_0.6.4        viridisLite_0.4.2    cluster_2.1.4        DoubletFinder_2.0.4  colorBlindness_0.1.9 RColorBrewer_1.1-3   lubridate_1.9.2      forcats_1.0.0        stringr_1.5.1        purrr_1.0.2          readr_2.1.4          tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.0        tidyverse_2.0.0      Seurat_5.1.0         SeuratObject_5.0.2   sp_2.1-3             dplyr_1.1.4         

loaded via a namespace (and not attached):
  [1] jsonlite_1.8.8         magrittr_2.0.3         spatstat.utils_3.0-4   farver_2.1.1           rmarkdown_2.26         vctrs_0.6.5            ROCR_1.0-11            spatstat.explore_3.2-6 htmltools_0.5.7        gridGraphics_0.5-1     sctransform_0.4.1      parallelly_1.37.0      KernSmooth_2.23-22     htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9             plotly_4.10.4          zoo_1.8-12             igraph_2.0.2           mime_0.12              lifecycle_1.0.4        pkgconfig_2.0.3        Matrix_1.6-5           R6_2.5.1               fastmap_1.1.1          fitdistrplus_1.1-11    future_1.33.1          shiny_1.8.0            digest_0.6.34          colorspace_2.1-0       patchwork_1.2.0        tensor_1.5             RSpectra_0.16-1        irlba_2.3.5.1          vegan_2.6-4            labeling_0.4.3         progressr_0.14.0       timechange_0.2.0       fansi_1.0.6            spatstat.sparse_3.0-3  mgcv_1.9-0             httr_1.4.7             polyclip_1.10-6        abind_1.4-5            compiler_4.3.1         withr_3.0.0            fastDummies_1.7.3      ggforce_0.4.1          MASS_7.3-60            permute_0.9-7          tools_4.3.1           
 [52] lmtest_0.9-40          httpuv_1.6.14          future.apply_1.11.1    goftest_1.2-3          glue_1.8.0             nlme_3.1-163           promises_1.2.1         grid_4.3.1             Rtsne_0.17             reshape2_1.4.4         generics_0.1.3         gtable_0.3.4           spatstat.data_3.0-4    tzdb_0.4.0             hms_1.1.3              data.table_1.15.4      utf8_1.2.4             spatstat.geom_3.2-8    RcppAnnoy_0.0.22       ggrepel_0.9.5          RANN_2.6.1             pillar_1.9.0           spam_2.10-0            RcppHNSW_0.6.0         later_1.3.2            splines_4.3.1          tweenr_2.0.2           lattice_0.21-8         survival_3.5-7         deldir_2.0-2           tidyselect_1.2.0       miniUI_0.1.1.1         pbapply_1.7-2          knitr_1.45             gridExtra_2.3          scattermore_1.2        xfun_0.42              graphlayouts_1.0.0     matrixStats_1.2.0      stringi_1.8.3          lazyeval_0.2.2         yaml_2.3.8             evaluate_0.23          codetools_0.2-19       cli_3.6.3              uwot_0.1.16            xtable_1.8-4           reticulate_1.35.0.9000 munsell_0.5.0          Rcpp_1.0.12            globals_0.16.2        
[103] spatstat.random_3.2-2  png_0.1-8              parallel_4.3.1         ellipsis_0.3.2         dotCall64_1.1-1        listenv_0.9.1          scales_1.3.0           ggridges_0.5.6         leiden_0.4.3.1         rlang_1.1.3            cowplot_1.1.3